An Efficient Phrase-to-Phrase Alignment Model for Arbitrarily Long Phrase and Large Corpora
نویسندگان
چکیده
Most statistical machine translation (SMT) systems use phrase-to-phrase translations to capture local context information, leading to better lexical choices and more reliable word reordering. Long phrases capture more contexts than short phrases and result in better translation qualities. On the other hand, the increasing amount of bilingual data poses serious problems for storing all possible phrases. In this paper, we describe a novel phrase-to-phrase alignment model which allows for arbitrarily long phrases and works for very large bilingual corpora. This model is very efficient in both time and space and the resulting translations are better than the state-of-the-art systems.
منابع مشابه
A Phrase-Based HMM Approach to Document/Abstract Alignment
We describe a model for creating word-to-word and phrase-to-phrase alignments between documents and their human written abstracts. Such alignments are critical for the development of statistical summarization systems that can be trained on large corpora of document/abstract pairs. Our model, which is based on a novel Phrase-Based HMM, outperforms both the Cut & Paste alignment model (Jing, 2002...
متن کاملStatistical machine translation with cascaded probabilistic transducers
Statistical machine translation is based on the idea to extract information from bilingual corpora, which can be used to generate new translations. The current work combines aspects from example-based machine translation and from grammar-based approaches, esp. bilingual regular grammars, to develop a statistical translation system based on cascaded transducers. These transducers can be construc...
متن کاملمدل ترجمه عبارت-مرزی با استفاده از برچسبهای کمعمق نحوی
Phrase-boundary model for statistical machine translation labels the rules with classes of boundary words on the target side phrases of training corpus. In this paper, we extend the phrase-boundary model using shallow syntactic labels including POS tags and chunk labels. With the priority of chunk labels, the proposed model names non-terminals with shallow syntactic labels on the boundaries of ...
متن کاملPhrase alignment confidence for statistical machine translation
The performance of phrase-based statistical machine translation (SMT) systems is crucially dependent on the quality of the extracted phrase pairs, which is in turn a function of word alignment quality. Data sparsity, an inherent problem in SMT even with large training corpora, often has an adverse impact on the reliability of the extracted phrase translation pairs. In this paper, we present a n...
متن کاملارائۀ راهکاری قاعدهمند جهت تبدیل خودکار درخت تجزیۀ نحوی وابستگی به درخت تجزیۀ نحوی ساختسازهای برای زبان فارسی
In this paper, an automatic method in converting a dependency parse tree into an equivalent phrase structure one, is introduced for the Persian language. In first step, a rule-based algorithm was designed. Then, Persian specific dependency-to-phrase structure conversion rules merged to the algorithm. Subsequently, the Persian dependency treebank with about 30,000 sentences was used as an input ...
متن کامل